We are constantly surrounded by information in today’s digital world. But not all of it is accurate. We rely on data to gauge whether something is true or false. Also, more and more data is being collected and stored and the information contained in these data needs to be unlocked.
We rarely see this data in its raw form. You can imagine how rows and rows of numbers could be very confusing to interpret. Because of this we usually use a method called data visualisation to present patterns and trends in data more easily.
Data visualisation is the translation of data into visual representations, like charts and graphs, to communicate the information contained in the data.
Whilst this method simplifies the process of understanding data it can also bend the truth and misrepresent information if care is not exercised.
An effective graph is a visualisation of data that unlocks and conveys information or a story from data. This may help verify or debunk beliefs or assumptions, answer questions about a topic/issue or guide further research.
The first element needed in data visualisation is … data!
When you start plotting the data, don’t expect your first or even third attempt to produce the final display. You will arrive at your preferred display after a few iterations.
What elements should a good plot contain?
I will illustrate these points in what follows.
We consider graduate admission figures for the autumn of 1973 at the University of California, Berkeley. The numbers, shown below, seem to imply that men applying for post-graduate studies were more likely than women to be admitted. It was argued that the difference was so large that it was unlikely to be due to chance. The data set is usually presented as follows:
| Applicants | Admitted | |
|---|---|---|
| Men | 8442 | 44% |
| Women | 4321 | 35% |
The data above has been aggregated through departments. However, it is known that different genders have different preferences of departments they apply to (women have a preference for e.g. psychology or English studies, men have a preference for e.g. engineering studies).
When examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas only four were significantly biased against women. The data from the six largest departments are listed below.
| Department | Applied_men | Admitted_men | Applied_women | Admitted_women |
|---|---|---|---|---|
| A | 825 | 62% | 108 | 82% |
| B | 560 | 63% | 25 | 68% |
| C | 325 | 37% | 593 | 34% |
| D | 417 | 33% | 375 | 35% |
| E | 191 | 28% | 393 | 24% |
| F | 373 | 6% | 341 | 7% |
We visualise this data set of categorical, non-ordinal data using a mosaic plot. Mosaic plots are useful for visualizing proportions in more than 2 dimensions.
The heights and lengths of each mosaic are proportional to the proportions in the margins. So, a very flat rectangle indicates, proportionally, very few applicants of the corresponding gender in a given department. A long rectangle in the admitted status indicates that, proportionally, it’s not so difficult to be accepted in the corresponding department. As we can see there is no evidence for a discrimination case. In Departments A and B applicants are mainly male but in Department A, proportionally, more female than male applicants were admitted. Department F is very competitive and has a high rejection rate, which applies nearly equally to both female and male applicants.
This mosaic also shows the explanation: Selective departments have more female applicants. It’s easy to see since the departments are ordered by selectiveness. Departments A and B let in many applicants, but they’re mostly male. The reverse is true for the rest. This means that the overall female population takes big admittance hits in departments C through F, while lots of males get in via departments A and B.
One of the perils when studying associations between a variable of interest and a set of explanatory variables is overfitting. If we use too many explanatory variables we may explain very well the observed values of the variable of interest but nothing else and so our study will have little predictive value.
Problems also occur when relevant explanatory variables are ignored. It is possible that when one ignores a relevant variable one observes an effect and when the variable is considered the opposite effect is observed. This is called Simpson’s paradox. What we have explored with the Berkeley graduate admissions data is one of the best-known examples of Simpson’s paradox.
Let us visualise the Berkeley admissions data using a treemap.
Like mosaic plots, a treemap visually displays proportions by varying the area of a rectangular shape. In a treemap, you can display hundreds, or thousands, of pieces of information. In a treemap you must arrange your data elements hierarchically using categorical variables, in a meaningful way for the information you want to display. The data you need is:
In the Berkeley admission data, the quantitative variable is “Number admitted”. The categorical variables are “Department”, “Admission Status”, and “Gender”, with that order of nesting (i.e. Gender within Admission Status within Department).
Note that the nesting or hierarchy is not always unique (you could nest admission status within department, for example). Therefore, you must think about what information you want to display and which nesting is most adequate.
For example, you could have procurement in government departments and each department has individual projects with a cost. There is only one natural hierarchy here, namely project within department.
For discussion:
Would you change anything in the treemap? Which plot do you think conveys better the information about admissions and gender association?
We will use data about Adults with HIV in Africa (estimated prevalence of HIV in percentage, ages 15-49) from Gapminder, 1990-2011.
The data consists of yearly HIV prevalence by country as well as income (GDP per capita, PPP$ inflation-adjusted) and population size.
For discussion: What would you change in this plot?
Income is a highly skewed variable as many countries have low to medium incomes and very few have very high incomes. Therefore, it is difficult to see the information contained in the scatter plots as the points are cluttered towards low income values.
We will apply a logarithmic (base 10) transformation to income. The logarithm is an increasing function and so the order in the x-axis will be preserved.
Most African countries have prevalence values in a scale which is about ten times than that of the rest of the world. This makes the visualisation difficult. It’s best to visualise the data for African countries separately.
For discussion: Can you think of any other ways of dealing with highly skewed variables? What would you change in the above plot?
Let us view the plot for Africa only.
To gain more insight, let us identify the African countries with HIV prevalence greater than or equal to 10%. We add labels that do not overlap.
For discussion: What would you change in this plot?
I will make the background color white and remove vertical grid lines. I will leave the frame around each plot because in faceted plots it’s good to know each of the plot boundaries.
We can also produce a dynamic plot showing one frame for each year. Follow Equatorial Guinea (gnq) in the bottom right and observe how the country becomes richer and its HIV prevalence increases.
For discussion: What would you change in this plot?
We can add a further variable, population size, with diameter of dots proportional to population size.
For discussion:
For discussion: What would you change in this plot?
Note that the legend title should be there as if it’s not it’s not clear why the dots have different sizes.
Let us follow the evolution of GDP and HIV prevalence in Equatorial Guinea.
A plain scatter plot is misleading because the points should be ordered by Year, not by GDP.
One way to add the time dimension when plotting two time series against each other is to add arrows indicating time evolution and time labels.
GDP and HIV prevalence have both been increasing in Equatorial Guinea until 2005. Note that the arrows only indicate the direction of joint evolution, not a correlation. In particular the arrows will be useful to explore the evolution after 2004.
HIV prevalence has increased steadily, except during 2008. GDP didn’t grow during 2005 and 2008-2010.
Given the geographical nature of the data, it suits itself for displaying it in a choropleth map.
The choropleth map below displays HIV prevalence in Africa in 2010.
To see the evolution of HIV prevalence over time, we can animate the choropleth map, showing one frame per year.
For discussion: Discuss the merits and suitability of choropleth maps and compare to the other visualizations discusssed in this section.
The data source is ONS.
We need data on two categorical variables, at least. One is the source and the other one is the target. Then we need a quantitative variable with amount flowing from source to target.
The thickness of the curves is proportional to the value flowing from node 1 (Funding source) to node 2 (Funding target).
One can have many nodes in a Sankey chart (the above has only 2 nodes).
The bars at each node are usually not displayed in Sankey charts.
For discussion
For discussion: What would you change in this plot? Tell a story from the above Sankey chart.
A waterfall chart illustrates how different quantitative elements contribute to a total. A waterfall chart disaggregates all of the unique components that contribute to a net change visualising them individually.
In the next example we use some fictitious sales data.
For discussion: How can you use waterfall charts for other applications besides sales?
Here we also view how a part contributes to a total.
Data from NHS (up to Nov 2021)
It’s more striking to view the data as a streamplot. A streamplot is like an area plot, except it is symmetric around zero.
For discussion: Discuss the differences between an area plot and a streamplot. When would you use one or the other?